AutoScore-Ordinal: an interpretable machine learning framework for generating scoring models for ordinal outcomes

Background Risk prediction models are useful tools in clinical decision-making which help with risk stratification and resource allocations and may lead to a better health care for patients. AutoScore is a machine learning–based automatic clinical score generator for binary outcomes. This study aims to expand the AutoScore framework to provide a tool for interpretable risk prediction for ordinal outcomes. Methods The AutoScore-Ordinal framework is generated using the same 6 modules of the original AutoScore algorithm including variable ranking, variable transformation, score derivation (from proportional odds models), model selection, score fine-tuning, and model evaluation. To illustrate the AutoScore-Ordinal performance, the method was conducted on electronic health records data from the emergency department at Singapore General Hospital over 2008 to 2017. The model was trained on 70% of the data, validated on 10% and tested on the remaining 20%. Results This study included 445,989 inpatient cases, where the distribution of the ordinal outcome was 80.7% alive without 30-day readmission, 12.5% alive with 30-day readmission, and 6.8% died inpatient or by day 30 post discharge. Two point-based risk prediction models were developed using two sets of 8 predictor variables identified by the flexible variable selection procedure. The two models indicated reasonably good performance measured by mean area under the receiver operating characteristic curve (0.758 and 0.793) and generalized c-index (0.737 and 0.760), which were comparable to alternative models. Conclusion AutoScore-Ordinal provides an automated and easy-to-use framework for development and validation of risk prediction models for ordinal outcomes, which can systematically identify potential predictors from high-dimensional data.


Introduction
Risk prediction models are mathematical equations which help clinicians estimate the probability of a healthcare outcome, given patient data. Such models include integer-point scores which can be used to predict that a disease is present (diagnostic models) or a specific outcome will occur (prognostic models), depending on the clinical question. A combination of multiple predictors Open Access † Seyed Ehsan Saffari and Yilin Ning contributed equally to this work. *Correspondence: liu.nan@duke-nus.edu.sg 1 Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore Full list of author information is available at the end of the article Page 2 of 13 Saffari et al. BMC Medical Research Methodology (2022) 22:286 (different weights for different predictors) is included into a multivariable model to calculate a risk score [1][2][3]. Some risk prediction models have been used in routine clinical settings, including the Framingham Risk Score [4], Ottawa Ankle Rules [5], Nottingham Prognostic Index [6], Gail model [7], Euro-SCORE [8], the modified Early Warning Score (MEWS) [9,10] and Simplified Acute Physiology Score [11]. The use of health information technology, particularly electronic health records (EHR), has increased in the past decade, which provides opportunities for big data research. EHR data includes detailed patient information and clinical outcome variables which can be a unique data source for risk model development [12,13]. Availability of a large number of variables in EHR data could be a mathematical challenge when using traditional regression analysis to build up a risk model. Machine learning (ML), as an alternative approach, applies mathematical algorithms to handle such big data resulting in novel risk prediction models. Traditional variable selection approaches (such as backward elimination, forward selection, stepwise selection using pre-specified stopping rules) may result in different subsets of variables in the context of EHR data, and clinical knowledge might not be always available in some clinical domains. Powerful feature selection techniques are available for supervised learning, which is a very critical aspect in risk model development when working with EHR data [13,14].
AutoScore [15] is an easy-to-use, machine learningbased automatic clinical score generator, which develops interpretable clinical scoring models. In an empirical experiment using EHR data, AutoScore generated scoring models that achieved comparable predictive performance as several conventional methods for risk model development but by using fewer variables [15]. The advantage of the AutoScore framework is the combination of efficient variable selection using ML techniques and the accessibility and interpretability of simple regression models. It can be easily used in different clinical settings and its applicability has been shown with a large number of variables (EHR data, for example) [15]. Some recent studies have used this framework to develop a risk prediction model in various clinical domains [16][17][18][19][20].
Most risk prediction models in the literature were developed using multivariable logistic regression models or ML techniques to predict a binary outcome. Aside from the AutoScore framework, ML applications include the use of Naive Bayes (NB), XGBoost, k-nearest neighbor (K-NN), multilayer perceptron, support vector machine (SVM) and CatBoost for predicting the risk of cardiovascular disease [21], random forest (RF), XGBoost, logistic regression, SVM and K-NN for the risk of incident diabetic retinopathy among patients with type 2 diabetes mellitus [22], a stroke risk prediction model using NB, decision tree and RF models [23], a XGBoost based cerebral infarction risk prediction model [24], and a developed risk model for 90-day mortality of patients undergoing gastric cancer resection with curative intent using cross validated elastic regularized logistic regression method, boosting linear regression, RF and an ensemble model [25].
Many clinical ordinal outcome variables exist and they are often dichotomized (favorable and unfavorable) or reduced to unordered categories for simplicity, e.g., in a cross-sectional study of emergency department (ED) triage [26] and a retrospective cohort study of ovarian cancer patients [27]. Nevertheless, it should not be ignored that such re-categorization results in loss of clinically and statistically relevant information, which may also involve difficulties in borderline patients (cases that can easily be categorized into either of the two levels of the outcome). One should note that analyzing ordinal variables has more statistical power in comparison to the corresponding re-categorized binary variables. This has been illustrated in both simulations and empirical studies in clinical trials [28][29][30][31][32]. Literature also recommends the use of the ordinal scale outcomes rather than dichotomization, as smaller treatment effect sizes are detectable via ordinal analysis [29,[33][34][35].
In the literature, ordinal outcome variables are discussed in several clinical domains, where the objective was either an association exploration or predictions. A large international study (including 26 hospitals from six countries) conducted ordinal logistic regression to study a composite ordinal outcome variable (defined as 1 = alive, no long length of stay [LOS], no readmission; 2 = alive, long LOS, no readmission; 3 = alive, no long LOS, readmission; 4 = alive, long LOS, readmission; 5 = death), and the correlation among different levels of the composite ordinal outcome at hospital level was reported [36]. ML methods using multiple biomarkers were performed to develop an ovarian cancer-specific predictive framework in a retrospective cohort study of 435 patients on a secondary ordinal outcome of residual tumor size (defined as: no residual tumor, < 1 cm residual tumor, ≥1 cm residual tumor), and the predictive accuracy and AUC were discussed [27]. Statistical and ML methods have been used for ordinal outcomes in the literature, e.g., the proportional odds model (POM) in middle ear dysfunction diagnosis of infants [37] and in a coronary artery disease study [38], ordinal RF in the aforementioned ovarian cancer study [27], multilayer perceptron with ordinal loss in a study across 9 mental health and suicide-related sub-Reddits [39], and 3D convulotional neural network model with ordinal binary decomposition in Parkinson's disease patients [40]. However, there is a lack of interpretability (where one may not easily understand the output of such complex and how it works, which is not recommended in healthcare domain [41]) and accessibility using these ML approaches, whereas the transparent POM is not as easily used as an interpretable risk scoring system in the clinic for real-time decision making. There is a lack of literature in model development using ordinal analysis that can be easily applied to clinical studies dealing with complex data (EHR, for example). The primary objective of this study was to expand the original AutoScore framework to provide a tool for easy development and validation of risk prediction models for ordinal outcomes. Hence the main contribution of the current study is not only the inclusion of the ordinal blocks, but also some modifications on the original AutoScore framework which leads to new methodological work and revised model performance measurements appropriate for ordinal outcomes. For illustration purpose, a risk prediction model was developed and validated using EHR data from the emergency department (as a real world data), where the ordinal outcome included three categories (alive without readmission to the hospital within 30 days post discharge, alive with readmission within 30 days post discharge and dead inpatient or within 30 days post discharge).

AutoScore-Ordinal framework
In this section we describe the 6 modules constituting the proposed AutoScore-Ordinal framework. In Module 1 (see Fig. 1) the data is first split into a training set to train prediction models, a validation set to select hyperparameters (e.g., number of variables, cut-off values for categorizing continuous variables), and a test set to evaluate the final model(s) selected. The three datasets typically contain 70%, 10% and 20% of the full dataset, respectively. Variables are ranked based on their importance to a RF [42] for multiclass classification (i.e., ignoring the ordering of categories), trained on the training set with a default number of 100 trees.
To simplify the interpretation and account for possible non-linear relationship between the predictor variables and the outcome, all continuous variables are categorized in Module 2 (see Fig. 1). To automate this process, AutoScore-Ordinal categorizes each continuous variable using the 5-th, 20-th, 80-th and 95-th percentiles (based on the training set) as cut-off values, but some cut-offs may be removed to avoid sparsity issues when the distribution of a variable is highly skewed. These (somewhat arbitrary) cut-off values provide reasonable initial configuration for subsequent score development, and can be fine-tuned by users in Module 5 (see detail below).
In Module 3 (see Fig. 1), weights associated with variables are developed using the cumulative link model   [43] with the logit link, also known as the proportional odds model (POM) [43,44], which is one of the most widely used regression models in studies of ordinal outcomes and has been integrated with deep learning approaches to handle complex (e.g., image) data [45]. Let scalar Y denote the ordinal outcome with J categories (denoted by integers 1, …, J) and column vector x denote the variables (with continuous variables readily categorized in Module 2). The POM assumes a linear model for the logit of the cumulative probabilities associated with the j-th ordinal category, i.e., The scalar terms θ j are category-specific intercept terms, where θ 1 < θ 2 < … < θ J − 1 to ensure p j < p k for any j < k. β is the vector of regression coefficients corresponding to the predictors. The negative sign before β follows from the notation used by McCullagh [43,44], such that a positive value of β indicates a positive association between x and Y, i.e., an increase in x leads to an increased probability of observing a higher category in Y. Hence an increase in x T β is always associated with increased probabilities of observing higher outcome categories, allowing us to construct prediction scores based on x T β. Another general approach for handling ordinal outcomes is ordinal binary decomposition, but it models an ordinal outcomes as several binary labels in separate models [46], making it challenging to derive a common score for the risk of being in each ordinal category.
A simple scaling and rounding of trained β values may generate a scoring model spanning negative and positive values with confusing interpretation, e.g., the arbitrary zero score may be misinterpreted as no risk. Hence, the POM is refitted after redefining reference categories in each variable such that all elements in β are positive, and β is normalised with respect to the minimum value of β. With all continuous variables readily categorised in Module 2, these normalised coefficients can be interpreted as scores associated with a category of a variable, referred to as partial scores. The partial scores (which are 0 for reference categories and 1 or larger otherwise) are rounded to positive integers to simplify the calculation of final prediction scores, which is the summation of all partial scores corresponding to the values of variables for an individual.
To facilitate interpretation, all partial scores are often rescaled (and then rounded) such that the maximum total score attenable is a meaningful value (e.g., 100).
To evaluate the performance of the final model, the prediction of outcome Y with J categories is divided into J − 1 binary classifications of Y ≤ j vs Y > j, and the mean area under the receiver operating characteristic curve (AUC) across these binary classifications (referred to as mAUC hereafter) is used to evaluate the overall performance for predicting Y, which is equivalent to the average dichotomized c-index for evaluating ordinal predictions [47,48]. In Module 4, a scoring model is grown by adding one variable at each time (based on the variable ranking from Module 1) until all candidate variables are included, and the improvement in mAUC (evaluated on the validation set) with increasing number of variables is inspected using the parsimony plot. The final list of variables is often selected when the benefit of adding a variable is small, where such small benefit could be assessed via visual inspection (by looking at parsimony plot) and clinical knowledge (and drop/include variables manually). Next, the cut-off values for continuous variables selected in Module 4 may be fine-tuned for favourable interpretation in Module 5, e.g., by using 10-year age groups instead of the arbitrarily defined quantile-based intervals. The final model is evaluated on the test set in Module 6 using the mAUC and Harrell's generalised c-index [47,49,50], which is based on the proportion of concordant pairs (i.e., when predictions and observed outcomes generate the same ranking for the pair of observations, including tied ranks) among all possible pairs of observations. For both mAUC and generalised c-index, a value of 0.5 indicates a random performance and a value of 1 indicates a perfect predictive performance. The mAUC and generalised c-index from the test set are reported with the bias-corrected 95% bootstrap confidence interval (CI) [51].

Data preparation
To demonstrate and validate our proposed AutoScore-Ordinal framework, we applied it in a clinical study in compliance with the checklist for assessment of medical AI [52]. We used AutoScore-Ordinal to predict readmission and death (composite outcome) after inpatient discharge, using data collected from patients who visited the emergency department (ED) of Singapore General Hospital in years 2008 to 2017 and were subsequently admitted to the hospital [53,54]. The full cohort included data on 449,593 ED presentation cases. Information on patient demographics, ED administration, inpatient admission, clinical tests and vital signs in ED, medical history and comorbidities was extracted from the hospital electronic health record system [16]. We excluded patients aged below 18, resulting in a final sample of 445,989 inpatient cases.
We constructed a composite ordinal outcome with three categories: alive without readmission to the We randomly split the dataset (stratified by outcome categories) into a training set of 70% (n = 312,193) cases to train models, a validation set of 10% (n = 44,599) cases to perform necessary model fine-tuning for AutoScore-Ordinal, and a test set of 20% (n = 89,197) cases to evaluate the performance of the final prediction models. For each case, we extracted the length of stay (LOS) of the previous inpatient admission (missing values were treated as 0 days). Missing values for vital signs or clinical tests were imputed using the median value in the validation set.
We compared the prediction model built using AutoScore-Ordinal with the RF (with 100 trees) and POM with LASSO or stepwise variable selection techniques. For each model, we computed the 95% CI for mAUC and generalized c-index from bootstrap samples of the test set (the number of bootstrap samples was selected as 100 for the demo purposes and can be modified in the AutoScore algorithm). Generalized c-index was computed based on the total score for AutoScoregenerated models, the linear predictor excluding intercept terms for POM and the predicted outcome categories for RF.

Implementation
All analyses were implemented in R version 4.0.5 [55]. Our proposed AutoScore-Ordinal is implemented as an R package, available from https:// github. com/ nliul ab/ AutoS core-Ordin al. POM was implemented using the clm function from package ordinal [56]. The stepAIC function from package MASS [57] was used to perform stepwise variable selection for POM, and the ordinal-Net function from package ordinalNet [58] was used to implement the LASSO approach. The RF was implemented using the randomForest function from package randomForest [59]. The bias-corrected bootstrap CI was implemented using the bca function from package coxed [60]. The generalized c-index was implemented using the rcorrcens function from package Hmisc [61].

Results
The characteristics of the full cohort are summarized in Table 1. Cases in the 3 outcome categories showed statistical difference in all variables, therefore it is non-trivial to develop a sparse prediction model based on POM.

Variable selection
The parsimony plot (see Fig. 2) suggests a reasonable model of the first 8 variables: ED LOS, creatinine, ED boarding time, number of visits in the previous year, age, systolic blood pressure (SBP), bicarbonate and pulse, which reached a mAUC that is only 7.9% lower than that the scoring model using all 41 variables. We refer to this model as Model 1. When using the parsimony plot to select variables, researchers are not restricted to consecutively select variables in the descending order of importance. For example, we built an alternative model (i.e., Model 2) with 8 variables, where we excluded the 3rd variable (i.e., ED boarding time) from Model 1 that had little impact on mAUC, and added the 14th variable (i.e., history of metastatic cancer in the past 5 years, which can be easily collected by asking the patient or the accompanying person/family/relatives) that incremented the mAUC by approximately 4% when it entered the prediction model.

Fine-tuning
All variables selected in the two models were continuous, and we fine-tuned their cut-off values in the categorization step to improve interpretability. The scoring tables after fine-tuning were shown in Table 2 for both models, and the performance of the resulting prediction models (evaluated on the test set) were reported in Table 3.

Interpreting prediction scores
The AutoScore-generated score (from Models 1 and 2) can be mapped to the likelihood of falling into different outcome categories based on the observed proportions in the training set. For example, we illustrate the use of Model 2 for risk prediction for a hypothetical new patient in Fig. 3. With values of the 8 variables measured for this new patient, clinicians can simply check relevant rows in the scoring table, summate the partial scores to a total score for this patient, and read the corresponding predicted probabilities for the three outcome categories in the lookup table. Such predicted probabilities can also be calculated from POM using a calculator or be returned from RF using designated software commands, but the checklist-style scoring table of AutoScore-generated models and the accompanying lookup tables of predicted probabilities are much easier to use in clinical practice.
We evaluate the calibration performance of Models 1 and 2, visually presented in Fig. 4. Specifically, we grouped subjects based on score intervals defined in the lookup table in Fig. 3, and plotted the observed risk of   being in each outcome category in the test set against the predicted risk (based on the lookup tables). Both Models 1 and 2 generated predicted risk similar to observed levels, indicated by dots close to the diagonal line. An increase in the scores (visually indicated by lighter color in Fig. 4) generally reflects an increased likelihood of being in a higher category in the outcome, whereas Model 2 has improved ability compared to Model 1 in differentiating different outcome categories given different predicted scores (indicated by a wider spread of dots along the diagonal line).

Comparison with other approaches
AutoScore-generated prediction models had comparable mAUC as the POM that used the same variables (see Table 3, where POM1 and POM2 correspond to Models 1 and 2 respectively). The RF using the same variables as Model 1 (see RF1 in Table 3) had a higher mAUC than Model 1, but when compared with Model 2 the advantage of the corresponding RF (see RF2 in Table 3) in terms of mAUC is less pronounced. AutoScore-generated models had slightly higher generalized c-index than the corresponding POMs, and both were higher than the corresponding RFs. In particular, the generalized c-index of RFs were much lower than the corresponding AutoScoregenerated models or POMs, due to the use of predicted labels instead of numeric scores when evaluating the performance of RF. When using traditional model building methods to build sparse POM, stepwise algorithm using AIC failed to work when starting from the null model (i.e., without any variable), and ended up selecting 35 variables when starting from the full model (i.e., including all 41 variables). Although this POM with 35 models had a high mAUC and generalized c-index (see POM (stepwise) in Table 3), it is difficult to use in practical settings. The LASSO approach selected 10 variables (i.e., ED LOS, gender, ED triage code, total number of ICU stays in the past year, admission type, SpO 2 , SBP, bicarbonate, sodium and diabetes with complications) that had much lower performance than other models (see POM (LASSO) in Table 3).

Discussion
A scoring system was developed using the AutoScore framework for ordinal outcomes in this study. The algorithm was applied on a case study to discuss the risk prediction model and its application on EHR data from the emergency department where the ordinal outcome includes three categories (alive without readmission to the hospital within 30 days post discharge, alive with readmission within 30 days post discharge and dead inpatient or within 30 days post discharge). The model was developed using 70% of the data (n = 312,193); validated on subset of 10% of the data (n = 44,599) to perform necessary model fine-tuning; and tested on a set of 20% (n = 89,197). The performance of the AutoScore-Ordinal model was checked against the alternative models including POM and RF using 100 bootstrap samples via mAUC and generalized c-index. The AutoScore-Ordinal identified two feasible scoring models with 8 variables, Table 2 Scoring table for AutoScore-generated   and both had slightly better performance than the POM and RF that use the same variables. The novelty of the AutoScore-Ordinal model is its easy-to-use and machine learning-based automatic clinical score generator features, which develops interpretable clinical scoring models and can be useful tools for clinical decision-making at different stages of clinical pathway.
Prediction models in clinical settings are useful tools to inform clinical decision-making at different stages of clinical practice [62,63]. To design, conduct and build prediction models, fundamental concepts including developing, validating and updating risk prediction models are discussed in the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual  Prognosis Or Diagnosis) Statement [64]. New risk models should always be validated to quantify the predictive ability of the model (for example, calibration and discrimination), which could be addressed via internal (bootstrapping, cross-validation, etc.) or external (independent cohort, for example) validation [64].
Most of the developed models in literature lacks of interpretability and accessibility while using machine learning techniques [26,27,39]. In contrast, the AutoScore-Ordinal via a point-based risk prediction model can be easily implemented in different clinical settings and fills a gap in interpretability, when dealing with ordinal outcomes. The advantages of the original AutoScore framework [15] applies to the AutoScore-Ordinal framework. AutoScore-Ordinal builds on the POM, which is suitable for analyzing ordinal outcomes and widely used in clinical and epidemiological research. Compared to conventional use of POM, AutoScore-Ordinal makes use of machine learning methods to build sparse prediction models with good prediction performance, whereas traditional approaches such as stepwise variable selection and LASSO may not work well. AutoScore-Ordinal creates a check-list style scoring model that is easily implemented in clinical settings. In clinical research, sometimes quantitative data are categorized as ordinal variables due to different reasons such as skewness or multi-modal distribution. Under such scenarios, dichotomization may not be ideal and could result in loss of clinically and statistically relevant information. One may take advantage of the AutoScore-Ordinal framework to deal with such ordinal outcome variables.
AutoScore-Ordinal provides an efficient, straightforward and flexible variable selection procedure based on the parsimony plot, which visually presents the improvement in model performance with a growing number of variables in the model. Intuitively, researchers can select the top few variables that correspond to a satisfying model performance and inclusion of an additional variable results in a small (e.g., < 1%) improvement, which resulted in Model 1 in our example. In addition, AutoScore-Ordinal allows researchers to manually add or remove variables from the final variables based on their contribution to model performance (e.g., as illustrated in Model 2) or practical implications. While the current AutoScore-Ordinal implementation uses the POM (or more generally the cumulative link model with the logit link) that is widely used in clinical applications, it can be used with other link functions (e.g., probit, complementary log-log) with minor modifications for possible improvements in model fit. Researchers may want to draw multiple parsimony plots to select a link function that best suits the data when determining variables to include in the final model. In our data example we trained RF with 100 trees when ranking variables in Module 1 of AutoScore-Ordinal and when using it as a prediction model. Researchers may want to increase the number of trees to improve performance in general applications, e.g., 500 trees is a common choice [65]. Due to the large sample size of our case study, we run out of memory when training an RF with 500 trees, and an RF with 200 trees generated comparable results when ranking variables and predicting ordinal outcomes.
As indicated by the name, POM assumes proportional odds, i.e., the effect of each variable on the outcome is the same across outcome categories. In univariable POM analyses of the training set (without categorizing continuous variables), the proportional odds assumption was rejected for all variables (with significance level of 5%). Future study should investigate how to relax this assumption when necessary without considerably complicating the interpretation of the resulting scoring model. Despite this, the two prediction models built using AutoScore-Ordinal worked reasonably well. For performance evaluation, we considered two metrics (i.e., mAUC and generalized c-index) that have straightforward interpretation and similar definition with metrics for binary and survival predictions [47,48,50]. Future work may consider other performance metrics, e.g., volume under the receiver operating characteristic surface (more generally the hypervolume under the manifold) [66] and the ordinal c-index [47] for ordinal prediction, or the M-index [67] and polytomous discrimination index [68,69] for multi-class outcomes without explicitly accounting for ordering of categories.
Our data example aims to illustrate the use of our proposed AutoScore-Ordinal framework. The prediction performance can be improved, e.g., although Model 2 had better performance than Model 1, it will most likely fail to predict any new case into category 2, as this category is dominated by the other two categories (see lookup table in Fig. 3). The AutoScore-Ordinal should be applied in other clinical domains with different sample sizes and various number of variables to establish external validity. Further investigation is required to improve performance before applying the AutoScore-Ordinalderived scoring models in clinical settings, e.g., inclusion of additional relevant variables, alternative imputation of missing values and cross-validation feature within the package. Another future research direction, as seen in the literature [70][71][72][73], is to integrate the AutoScore-Ordinal package as a mobile application where it could be easily accessible to the clinicians. Nonetheless, AutoScore-Ordinal provides a powerful, flexible and easy-to-use framework for developing interpretable scoring models for ordinal clinical outcomes.

Conclusion
AutoScore-Ordinal as a risk prediction model was developed for ordinal outcome variable. For illustration purpose, the framework was implemented and validated using EHR data from the emergency department, where the ordinal outcome included three categories (alive without readmission to the hospital within 30 days post discharge, alive with readmission within 30 days post discharge and dead inpatient or within 30 days post discharge). An efficient and flexible variable selection procedure was explained and the model indicated a comparable goodness-of-fit in compared to the alternative models. The point-based risk prediction model generated by the AutoScore-Ordinal is easy to implement and interpret in different clinical settings.